{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "25733da0",
   "metadata": {},
   "source": [
    "# MongoDB-Atlas\n",
    "\n",
    "- Author: [Ivy Bae](https://github.com/ivybae), [Jongho Lee](https://github.com/XaviereKU)\n",
    "- Peer Review : [Haseom Shin](https://github.com/IHAGI-c), [ro__o_jun](https://github.com/ro-jun), [Sohyeon Yim](https://github.com/sohyunwriter)\n",
    "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
    "\n",
    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/09-VectorStore/07-MongoDB.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/09-VectorStore/07-MongoDB.ipynb)\n",
    "\n",
    "## Overview\n",
    "\n",
    "This tutorial covers how to use ```MongoDB-Atlas``` with **LangChain** .\n",
    "\n",
    "[MongoDB Atlas](https://www.mongodb.com/en/atlas) is a multi-cloud database service that provides an easy way to host and manage your data in the cloud.\n",
    "\n",
    "This tutorial walks you through using **CRUD** operations with the ```MongoDB-Atlas``` **storing** , **updating** , **deleting** documents, and performing **similarity-based retrieval** .\n",
    "\n",
    "### Table of Contents\n",
    "\n",
    "- [Overview](#overview)\n",
    "- [Environment Setup](#environment-setup)\n",
    "- [What is MongoDB-Atlas?](#what-is-mongodb-atlas?)\n",
    "- [Data](#data)\n",
    "- [Initial Setting MongoDB-Atlas](#initial-setting-mongodb-atlas)\n",
    "- [Document Manager](#document-manager)\n",
    "\n",
    "\n",
    "### References\n",
    "\n",
    "- [Get Started with Atlas](https://www.mongodb.com/docs/atlas/getting-started/)\n",
    "- [Deploy a Free Cluster](https://www.mongodb.com/docs/atlas/tutorial/deploy-free-tier-cluster/)\n",
    "- [Connection Strings](https://www.mongodb.com/docs/manual/reference/connection-string/)\n",
    "- [Atlas Search and Vector Search Indexes](https://www.mongodb.com/docs/languages/python/pymongo-driver/current/indexes/atlas-search-index/)\n",
    "- [Review Atlas Search Index Syntax](https://www.mongodb.com/docs/atlas/atlas-search/index-definitions/)\n",
    "- [JSON and BSON](https://www.mongodb.com/resources/basics/json-and-bson)\n",
    "- [Write Data to MongoDB](https://www.mongodb.com/docs/languages/python/pymongo-driver/current/write-operations/)\n",
    "- [Read Data from MongoDB](https://www.mongodb.com/docs/languages/python/pymongo-driver/current/read/)\n",
    "- [Query Filter Documents](https://www.mongodb.com/docs/manual/core/document/#query-filter-documents)\n",
    "- [Update Operators](https://www.mongodb.com/docs/manual/reference/operator/update/)\n",
    "- [Integrate Atlas Vector Search with LangChain](https://www.mongodb.com/docs/atlas/atlas-vector-search/ai-integrations/langchain/)\n",
    "- [Get Started with the LangChain Integration](https://www.mongodb.com/docs/atlas/atlas-vector-search/ai-integrations/langchain/get-started/)\n",
    "- [Comparison Query Operators](https://www.mongodb.com/docs/manual/reference/operator/query-comparison/)\n",
    "- [MongoDB Atlas](https://python.langchain.com/docs/integrations/vectorstores/mongodb_atlas/)\n",
    "- [Document loaders](https://python.langchain.com/docs/concepts/document_loaders/)\n",
    "- [Text splitters](https://python.langchain.com/docs/concepts/text_splitters/)\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1fac085",
   "metadata": {},
   "source": [
    "## Environment Setup\n",
    "\n",
    "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n",
    "\n",
    "**[Note]**\n",
    "- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n",
    "- You can checkout the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "98da7994",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture --no-stderr\n",
    "%pip install langchain-opentutorial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "800c732b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install required packages\n",
    "from langchain_opentutorial import package\n",
    "\n",
    "package.install(\n",
    "    [\n",
    "        \"langsmith\",\n",
    "        \"langchain-core\",\n",
    "        \"python-dotenv\",\n",
    "        \"langchain_openai\",\n",
    "        \"langchain_community\",\n",
    "        \"pymongo\",\n",
    "        \"certifi\",\n",
    "    ],\n",
    "    verbose=False,\n",
    "    upgrade=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "5b36bafa",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Environment variables have been set successfully.\n"
     ]
    }
   ],
   "source": [
    "# Set environment variables\n",
    "from langchain_opentutorial import set_env\n",
    "\n",
    "set_env(\n",
    "    {\n",
    "        \"OPENAI_API_KEY\": \"\",\n",
    "        \"LANGCHAIN_API_KEY\": \"\",\n",
    "        \"LANGCHAIN_TRACING_V2\": \"true\",\n",
    "        \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n",
    "        \"LANGCHAIN_PROJECT\": \"{Project Name}\",\n",
    "        \"MONGODB_ATLAS_CLUSTER_URI\": \"{Your Atlas URI}\",\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8011a0c7",
   "metadata": {},
   "source": [
    "You can alternatively set API keys such as ```OPENAI_API_KEY``` in a ```.env``` file and load them.\n",
    "\n",
    "[Note] This is not necessary if you've already set the required API keys in previous steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "70d7e764",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from dotenv import load_dotenv\n",
    "\n",
    "load_dotenv(override=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e73a30dd",
   "metadata": {},
   "source": [
    "To set ```MONGODB_ATLAS_CLUSTER_URI``` you need to sign up and and create a cluster from [MongoDB Atlas](https://www.mongodb.com/en/atlas)\n",
    "\n",
    "**Atlas** can be started with [Atlas CLI](https://www.mongodb.com/docs/atlas/cli/current/atlas-cli-getting-started/) or **Atlas UI**.\n",
    "\n",
    "**Atlas CLI** can be difficult to use if you're not used to working with development tools, so this tutorial will walk you through how to use **Atlas UI**.\n",
    "\n",
    "To deploy a cluster, please select the appropriate project in your **Organization**. If the project doesn't exist, you'll need to create it.\n",
    "\n",
    "If you select a project, you can create a cluster.\n",
    "\n",
    "![mongodb-atlas-project](./assets/07-mongodb-atlas-initialization-01.png)\n",
    "\n",
    "Follow the procedure below to deploy a cluster\n",
    "\n",
    "- select **Cluster**: **M0** Free cluster option\n",
    "\n",
    "> Note: You can deploy only one Free cluster per Atlas project\n",
    "\n",
    "- select **Provider**: **M0** on AWS, GCP, and Azure\n",
    "\n",
    "- select **Region**\n",
    "\n",
    "- create a database user and add your IP address settings.\n",
    "\n",
    "After you deploy a cluster, you can see the cluster you deployed as shown in the image below.\n",
    "\n",
    "![mongodb-atlas-cluster-deploy](./assets/07-mongodb-atlas-initialization-02.png)\n",
    "\n",
    "Click **Get connection string** in the image above to get the cluster URI.\n",
    "\n",
    "Now set the value of `MONGODB_ATLAS_CLUSTER_URI` in the `.env` file or set it directly inside the Set environment variables cell.\n",
    "\n",
    "The **connection string** resembles the following example:\n",
    "\n",
    "> mongodb+srv://[databaseUser]:[databasePassword]@[clusterName].[hostName].mongodb.net/?retryWrites=true&w=majority\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6890920d",
   "metadata": {},
   "source": [
    "## What is MongoDB-Atlas?\n",
    "\n",
    "[MongoDB Atlas](https://www.mongodb.com/en/atlas) is a multi-cloud database service that provides an easy way to host and manage your data in the cloud.\n",
    "\n",
    "It provides security by blocking all other IPs except user approved.\n",
    "\n",
    "Text based search and vector based similarity search are provided and you can choose what field to index for future use, like pre-filter.\n",
    "\n",
    "So you don't need to waste some spaced for unused indexing.\n",
    "\n",
    "You can change settings for the indexed fields on Atlas webpage, and can control other things."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f3b5bd2",
   "metadata": {},
   "source": [
    "## Prepare Data\n",
    "\n",
    "This section guides you through the **data preparation process** .\n",
    "\n",
    "This section includes the following components:\n",
    "\n",
    "- Data Introduction\n",
    "\n",
    "- Preprocess Data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "508ae7f7",
   "metadata": {},
   "source": [
    "### Data Introduction\n",
    "\n",
    "In this tutorial, we will use the fairy tale **📗 The Little Prince** in PDF format as our data.\n",
    "\n",
    "This material complies with the **Apache 2.0 license** .\n",
    "\n",
    "The data is used in a text (.txt) format converted from the original PDF.\n",
    "\n",
    "You can view the data at the link below.\n",
    "- [Data Link](https://huggingface.co/datasets/sohyunwriter/the_little_prince)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "004ea4f4",
   "metadata": {},
   "source": [
    "### Preprocess Data\n",
    "\n",
    "In this tutorial section, we will preprocess the text data from The Little Prince and convert it into a list of ```LangChain Document``` objects with metadata. \n",
    "\n",
    "Each document chunk will include a ```title``` field in the metadata, extracted from the first line of each section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8e4cac64",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.schema import Document\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "import re\n",
    "from typing import List\n",
    "\n",
    "\n",
    "def preprocessing_data(content: str) -> List[Document]:\n",
    "    # 1. Split the text by double newlines to separate sections\n",
    "    blocks = content.split(\"\\n\\n\")\n",
    "\n",
    "    # 2. Initialize the text splitter\n",
    "    text_splitter = RecursiveCharacterTextSplitter(\n",
    "        chunk_size=500,  # Maximum number of characters per chunk\n",
    "        chunk_overlap=50,  # Overlap between chunks to preserve context\n",
    "        separators=[\"\\n\\n\", \"\\n\", \" \"],  # Order of priority for splitting\n",
    "    )\n",
    "\n",
    "    documents = []\n",
    "\n",
    "    # 3. Loop through each section\n",
    "    for block in blocks:\n",
    "        lines = block.strip().splitlines()\n",
    "        if not lines:\n",
    "            continue\n",
    "\n",
    "        # Extract title from the first line using square brackets [ ]\n",
    "        first_line = lines[0]\n",
    "        title_match = re.search(r\"\\[(.*?)\\]\", first_line)\n",
    "        title = title_match.group(1).strip() if title_match else \"\"\n",
    "\n",
    "        # Remove the title line from content\n",
    "        body = \"\\n\".join(lines[1:]).strip()\n",
    "        if not body:\n",
    "            continue\n",
    "\n",
    "        # 4. Chunk the section using the text splitter\n",
    "        chunks = text_splitter.split_text(body)\n",
    "\n",
    "        # 5. Create a LangChain Document for each chunk with the same title metadata\n",
    "        for chunk in chunks:\n",
    "            documents.append(Document(page_content=chunk, metadata={\"title\": title}))\n",
    "\n",
    "    print(f\"Generated {len(documents)} chunked documents.\")\n",
    "\n",
    "    return documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "1d091a51",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generated 262 chunked documents.\n"
     ]
    }
   ],
   "source": [
    "# Load the entire text file\n",
    "with open(\"./data/the_little_prince.txt\", \"r\", encoding=\"utf-8\") as f:\n",
    "    content = f.read()\n",
    "\n",
    "# Preprocess Data\n",
    "docs = preprocessing_data(content=content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1977d4ff",
   "metadata": {},
   "source": [
    "## Setting up MongoDB-Atlas\n",
    "\n",
    "This part walks you through the initial setup of ```MongoDB-Atlas```.\n",
    "\n",
    "This section includes the following components:\n",
    "\n",
    "- Load Embedding Model\n",
    "\n",
    "- Load ```MongoDB-Atlas``` Client"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7eee56b2",
   "metadata": {},
   "source": [
    "### Load Embedding Model\n",
    "\n",
    "In this section, you'll learn how to load an embedding model.\n",
    "\n",
    "This tutorial uses **OpenAI's** **API-Key** for loading the model.\n",
    "\n",
    "*💡 If you prefer to use another embedding model, see the instructions below.*\n",
    "- [Embedding Models](https://python.langchain.com/docs/integrations/text_embedding/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "5bd5c3c9",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from langchain_openai import OpenAIEmbeddings\n",
    "\n",
    "embedding = OpenAIEmbeddings(model=\"text-embedding-3-large\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40f65795",
   "metadata": {},
   "source": [
    "### Load MongoDB-Atlas Client\n",
    "\n",
    "In this section, we'll show you how to load the **database client object** using the **Python SDK** for ```MongoDB-Atlas``` .\n",
    "- [Python SDK Docs](https://docs.trychroma.com/docs/overview/introduction)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "eed0ebad",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create Database Client Object Function\n",
    "from pymongo import MongoClient\n",
    "import certifi\n",
    "\n",
    "\n",
    "def get_db_client(URI):\n",
    "    \"\"\"\n",
    "    Initializes and returns a VectorStore client instance.\n",
    "\n",
    "    This function loads configuration (e.g., API key, host) from environment\n",
    "    variables or default values and creates a client object to interact\n",
    "    with the MongoDB-Atlas Python SDK.\n",
    "\n",
    "    Returns:\n",
    "        client:ClientType - An instance of the Chroma client.\n",
    "\n",
    "    Raises:\n",
    "        ValueError: If required configuration is missing.\n",
    "    \"\"\"\n",
    "    client = MongoClient(URI, tlsCAFile=certifi.where())\n",
    "    return client"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "2b5f4116",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get DB Client Object\n",
    "URI = os.getenv(\"MONGODB_ATLAS_CLUSTER_URI\")\n",
    "client = get_db_client(URI)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e0a54d1",
   "metadata": {},
   "source": [
    "### Create Collection\n",
    "\n",
    "If you are successfully connected to ```MongoDB-Atlas```, there is a sample collection.\n",
    "\n",
    "But in this tutorial we will create a new collection with ```MongoDBAtlasCollectionManager```."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "891a6995",
   "metadata": {},
   "outputs": [],
   "source": [
    "from utils.mongodb_atlas import MongoDBAtlasCollectionManager\n",
    "\n",
    "# Get collectionManager\n",
    "collectionManager = MongoDBAtlasCollectionManager(\n",
    "    db_name=\"langchain-opentutorial-db\", client=client\n",
    ")\n",
    "\n",
    "# Create new collection\n",
    "collectionManager.create_collection(\"little-prince\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d63e26b",
   "metadata": {},
   "source": [
    "After you created a collection, you can check it on Atlas webpage.\n",
    "![mongodb-atlas-collection](./assets/07-mongodb-atlas-database.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf8d1198",
   "metadata": {},
   "source": [
    "### Create Vector Search Index\n",
    "\n",
    "To perform ector search in Atlas, you must create an **Atlas Vector Search Index**.\n",
    "\n",
    "First, either define **Atlas Search Index** or **Atlas Vector Search Index** using `SearchIndexModel` object.\n",
    "\n",
    "- `definition` : define the **Search Index**.\n",
    "\n",
    "- `name` : query the **Search Index** by name.\n",
    "\n",
    "To learn more about `definition` of `SearchIndexModel` , see [Review Atlas Search Index Syntax](https://www.mongodb.com/docs/atlas/atlas-search/index-definitions/).\n",
    "\n",
    "**[NOTE]**\n",
    "\n",
    "When you make an index and if you want to make some metadata to be used as a filter, you need to specify it.\n",
    "\n",
    "In the following example, we set ```title``` to be used as a filter for later use in ```vector_index```, but not set in ```search_index```."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "1b7e2efe",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pymongo.operations import SearchIndexModel\n",
    "\n",
    "TEST_SEARCH_INDEX_NAME = \"test_search_index\"\n",
    "TEST_VECTOR_SEARCH_INDEX_NAME = \"test_vector_index\"\n",
    "\n",
    "search_index = SearchIndexModel(\n",
    "    definition={\n",
    "        \"mappings\": {\"dynamic\": True},\n",
    "    },\n",
    "    name=TEST_SEARCH_INDEX_NAME,\n",
    ")\n",
    "\n",
    "vector_index = SearchIndexModel(\n",
    "    definition={\n",
    "        \"fields\": [\n",
    "            {\n",
    "                \"type\": \"vector\",\n",
    "                \"numDimensions\": embedding.embed_query(\"Hello\").__len__(),\n",
    "                \"path\": \"embedding\",\n",
    "                \"similarity\": \"cosine\",\n",
    "            },\n",
    "            {\n",
    "                \"type\": \"filter\",\n",
    "                \"path\": \"title\"\n",
    "            }\n",
    "        ]\n",
    "    },\n",
    "    name=TEST_VECTOR_SEARCH_INDEX_NAME,\n",
    "    type=\"vectorSearch\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "619aae20",
   "metadata": {},
   "source": [
    "Now we can create **index** based on ```SearchIndexModel``` defined above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "bddcfa9b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create actual index\n",
    "collectionManager.create_index(TEST_SEARCH_INDEX_NAME, search_index)\n",
    "collectionManager.create_index(TEST_VECTOR_SEARCH_INDEX_NAME, vector_index)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9785ec0f",
   "metadata": {},
   "source": [
    "After you created indexes, you can check it on Atlas webpage, search tab.\n",
    "\n",
    "![mongodb-atlas-search-index](./assets/07-mongodb-atlas-search-index-01.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a5a97a0",
   "metadata": {},
   "source": [
    "## Document Manager\n",
    "\n",
    "To support the **Langchain-Opentutorial** , we implemented a custom set of **CRUD** functionalities for VectorDBs. \n",
    "\n",
    "The following operations are included:\n",
    "\n",
    "- ```upsert``` : Update existing documents or insert if they don’t exist\n",
    "\n",
    "- ```upsert_parallel``` : Perform upserts in parallel for large-scale data\n",
    "\n",
    "- ```similarity_search``` : Search for similar documents based on embeddings\n",
    "\n",
    "- ```delete``` : Remove documents based on filter conditions\n",
    "\n",
    "Each of these features is implemented as class methods specific to each VectorDB.\n",
    "\n",
    "In this tutorial, you can easily utilize these methods to interact with your VectorDB.\n",
    "\n",
    "*We plan to continuously expand the functionality by adding more common operations in the future.*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65a40601",
   "metadata": {},
   "source": [
    "### Create Instance\n",
    "\n",
    "First, we create an instance of the **MongoDB-Atlas** helper class to use its CRUD functionalities.\n",
    "\n",
    "This class is initialized with the **MongoDB-Atlas Python SDK client instance** and the **embedding model instance** , both of which were defined in the previous section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "dccab807",
   "metadata": {},
   "outputs": [],
   "source": [
    "from utils.mongodb_atlas import MongoDBAtlasDocumentManager\n",
    "\n",
    "crud_manager = MongoDBAtlasDocumentManager(\n",
    "    client=client,\n",
    "    db_name=\"langchain-opentutorial-db\",\n",
    "    collection_name=\"little-prince\",\n",
    "    embedding=embedding,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1c0c67f",
   "metadata": {},
   "source": [
    "Now you can use the following **CRUD** operations with the ```crud_manager``` instance.\n",
    "\n",
    "These instance allow you to easily manage documents in your ```MongoDB-Atlas``` ."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c6c53c5",
   "metadata": {},
   "source": [
    "### Upsert Document\n",
    "\n",
    "**Update** existing documents or **insert** if they don’t exist\n",
    "\n",
    "**✅ Args**\n",
    "\n",
    "- ```texts``` : Iterable[str] – List of text contents to be inserted/updated.\n",
    "\n",
    "- ```metadatas``` : Optional[List[Dict]] – List of metadata dictionaries for each text (optional).\n",
    "\n",
    "- ```ids``` : Optional[List[str]] – Custom IDs for the documents. If not provided, IDs will be auto-generated.\n",
    "\n",
    "- ```**kwargs``` : Extra arguments for the underlying vector store.\n",
    "\n",
    "**🔄 Return**\n",
    "\n",
    "- None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "f3a6c32b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from uuid import uuid4\n",
    "\n",
    "ids = [str(uuid4()) for _ in docs]\n",
    "\n",
    "args = {\n",
    "    \"texts\": [doc.page_content for doc in docs[:2]],\n",
    "    \"metadatas\": [doc.metadata for doc in docs[:2]],\n",
    "    \"ids\": ids[:2],\n",
    "    # Add additional parameters if you need\n",
    "}\n",
    "crud_manager.upsert(**args)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "278fe1ed",
   "metadata": {},
   "source": [
    "### Upsert Parallel\n",
    "\n",
    "Perform **upsert** in **parallel** for large-scale data\n",
    "\n",
    "**✅ Args**\n",
    "\n",
    "- ```texts``` : Iterable[str] – List of text contents to be inserted/updated.\n",
    "\n",
    "- ```metadatas``` : Optional[List[Dict]] – List of metadata dictionaries for each text (optional).\n",
    "\n",
    "- ```ids``` : Optional[List[str]] – Custom IDs for the documents. If not provided, IDs will be auto-generated.\n",
    "\n",
    "- ```batch_size``` : int – Number of documents per batch (default: 32).\n",
    "\n",
    "- ```workers``` : int – Number of parallel workers (default: 10).\n",
    "\n",
    "- ```**kwargs``` : Extra arguments for the underlying vector store.\n",
    "\n",
    "**🔄 Return**\n",
    "\n",
    "- None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "a89dd8e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from uuid import uuid4\n",
    "\n",
    "args = {\n",
    "    \"texts\": [doc.page_content for doc in docs],\n",
    "    \"metadatas\": [doc.metadata for doc in docs],\n",
    "    \"ids\": ids,\n",
    "    # Add additional parameters if you need\n",
    "}\n",
    "\n",
    "crud_manager.upsert_parallel(**args)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6beea197",
   "metadata": {},
   "source": [
    "### Similarity Search\n",
    "\n",
    "Search for **similar documents** based on **embeddings** .\n",
    "\n",
    "This method uses **\"cosine similarity\"** .\n",
    "\n",
    "\n",
    "**✅ Args**\n",
    "\n",
    "- ```query``` : str – The text query for similarity search.\n",
    "\n",
    "- ```k``` : int – Number of top results to return (default: 10).\n",
    "\n",
    "- ```filter``` : dict - Pre-filter applied for similarity search. Consist of ```field```, ```operator``` and ```value``` (default: None).\n",
    "\n",
    "- ```**kwargs``` : Additional search options (e.g., filters).\n",
    "\n",
    "**🔄 Return**\n",
    "\n",
    "- ```results``` : List[Document] – A list of LangChain Document objects ranked by similarity.\n",
    "\n",
    "\n",
    "To make a filter, you need to pass a dictionary like below\n",
    "```python\n",
    "{\"field\": {\"operator\": \"value\"}}\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "5859782b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Rank 0 | Title : Chapter 4\n",
      "Contents : If I have told you these details about the asteroid, and made a note of its number for you, it is on account of the grown-ups and their ways. When you tell them that you have made a new friend, they never ask you any questions about essential matters. They never say to you, \"What does his voice sound like? What games does he love best? Does he collect butterflies?\" Instead, they demand: \"How old is he? How many brothers has he? How much does he weigh? How much money does his father make?\" Only\n",
      "Score : 0.6709620356559753\n",
      "\n",
      "Rank 1 | Title : Chapter 13\n",
      "Contents : \"Eh? Are you still there? Five-hundred-and-one million-- I can‘t stop... I have so much to do! I am concerned with matters of consequence. I don‘t amuse myself with balderdash. Two and five make seven...\" \n",
      "\"Five-hundred-and-one million what?\" repeated the little prince, who never in his life had let go of a question once he had asked it.\n",
      "The businessman raised his head.\n",
      "Score : 0.663661777973175\n",
      "\n",
      "Rank 2 | Title : Antoine de Saiot-Exupery\n",
      "Contents : For Saint-Exupéry, it was a grand adventure - one with dangers lurking at every corner. Flying his open cockpit biplane, Saint-Exupéry had to fight the desert's swirling sandstorms. Worse, still, he ran the risk of being shot at by unfriendly tribesmen below. Saint-Exupéry couldn't have been more thrilled. Soaring across the Sahara inspired him to spend his nights writing about his love affair with flying.\n",
      "Score : 0.662699818611145\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Search by Query\n",
    "results = crud_manager.search(query=\"What is essential is invisible to the eye.\", k=3, vector_index='test_vector_index')\n",
    "for idx,doc in enumerate(results):\n",
    "    print(f\"Rank {idx} | Title : {doc['title']}\")\n",
    "    print(f\"Contents : {doc['page_content']}\")\n",
    "    print(f\"Score : {doc['score']}\")\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf8dd5c4",
   "metadata": {},
   "source": [
    "Now, let us create a filter to restrict our manager to search only the chunks with the **title equals Chapter 4.**\n",
    "\n",
    "```MongoDBAtlas``` supports following operators.\n",
    "\n",
    "|Operator|Type|Description|\n",
    "|---|---|---|\n",
    "|$eq|Equals|equal to the value|\n",
    "|$ne|Equals|not equal to the value|\n",
    "|$gt|Range|greater than the value|\n",
    "|$lt|Range|less than the value|\n",
    "|$gte|Range|greater or equal to the value|\n",
    "|$lte|Range|less or equal to the value|\n",
    "|$in|Inclusive|included among the values|\n",
    "|$nin|Inclusive|not included among the values|\n",
    "|$not|Logical|logical not|\n",
    "|$nor|Logical|logical nor|\n",
    "|$and|Logical|logical and|\n",
    "|$or|Logical|logical or|"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "2577dd4a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Rank 0 | Title : Chapter 4\n",
      "Contents : Grown-ups are like that... \n",
      "Fortunately, however, for the reputation of Asteroid B-612, a Turkish dictator made a law that his subjects, under pain of death, should change to European costume. So in 1920 the astronomer gave his demonstration all over again, dressed with impressive style and elegance. And this time everybody accepted his report. \n",
      "(picture)\n",
      "Score : 0.8311741352081299\n",
      "\n",
      "Rank 1 | Title : Chapter 4\n",
      "Contents : - the narrator speculates as to which asteroid from which the little prince came　　\n",
      "I had thus learned a second fact of great importance: this was that the planet the little prince came from was scarcely any larger than a house!\n",
      "Score : 0.8178739547729492\n",
      "\n",
      "Rank 2 | Title : Chapter 4\n",
      "Contents : If you were to say to the grown-ups: \"I saw a beautiful house made of rosy brick, with geraniums in the windows and doves on the roof,\" they would not be able to get any idea of that house at all. You would have to say to them: \"I saw a house that cost $20,000.\" Then they would exclaim: \"Oh, what a pretty house that is!\"\n",
      "Score : 0.7436615228652954\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Create Filter\n",
    "filters = {\"title\": {\"$eq\": \"Chapter 4\"}}\n",
    "\n",
    "# Filter Search\n",
    "results = crud_manager.search(query=\"Which asteroid did the little prince come from?\",k=3, filters=filters, vector_index='test_vector_index')\n",
    "for idx,doc in enumerate(results):\n",
    "    print(f\"Rank {idx} | Title : {doc['title']}\")\n",
    "    print(f\"Contents : {doc['page_content']}\")\n",
    "    print(f\"Score : {doc['score']}\")\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9ad0ed0c",
   "metadata": {},
   "source": [
    "### Delete Document\n",
    "\n",
    "Remove documents based on filter conditions\n",
    "\n",
    "**✅ Args**\n",
    "\n",
    "- ```ids``` : Optional[List[str]] – List of document IDs to delete. If None, deletion is based on filter.\n",
    "\n",
    "- ```filters``` : Optional[Dict] – Dictionary specifying filter conditions (e.g., metadata match).\n",
    "\n",
    "- ```**kwargs``` : Any additional parameters.\n",
    "\n",
    "**🔄 Return**\n",
    "\n",
    "- None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "0e3a2c33",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Delete by ids\n",
    "ids = ids[:2] # The 'ids' value you want to delete\n",
    "crud_manager.delete(ids=ids)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "60bcb4cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Delete by ids with filters\n",
    "ids = ids # The `ids` value corresponding to chapter 6\n",
    "crud_manager.delete(ids=ids,filters={\"title\":{\"$eq\": \"Chapter 6\"}})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "30d42d2e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Delete All\n",
    "crud_manager.delete()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "langchain-opentutorial-FtaFqYLT-py3.11",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
